Exploring gene content with pangene graphs

Abstract Motivation The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome. Results We developed pangene, a computational tool to identify gene orientation, gene order, and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs, which we call bibubbles, that capture gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools. Availability and implementation Source code at https://github.com/lh3/pangene; prebuilt pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at https://pangene.bioinweb.org


Introduction
A human genome contains about 20,000 protein-coding genes.A small number of them have frequent copy-number or gene order changes in the human population (Sudmant et al., 2010;Handsaker et al., 2015).These genes are under fast evolution and may be responsible for immune responses, affecting brain functionality (Ju et al., 2016) and drug metabolism (Taylor et al., 2020), or associated with known diseases (Mercuri et al., 2022).They may have profound biological and biomedical implications.
Thanks to the recent advances in sequencing technologies (Wenger et al., 2019) and assembly algorithms (Nurk et al., 2020;Cheng et al., 2021;Rautiainen et al., 2023), we can routinely achieve haplotype-resolved assembly over genes under copy-number or order changes.We have also developed algorithms to construct pangenome sequence graphs that encode variations between genomes.However, how to identify these gene-level variations is not straightforward.Among the three whole-genome pangenome construction tools used by the Human Pangenome Reference Consortium (HPRC), minigraph (Li et al., 2020) and minigraph-cactus (Hickey et al., 2023) are unable to align through complex genomic regions and may miss genes in long segmental duplications; PGGB (Garrison et al., 2023) collapses paralogous genes which makes it difficult to study individual paralogs.In addition, all three tools do not directly reveal how genomic variations affect genes.To study gene-level variations, HPRC had to manually annotate genes on each haplotype (Liao et al., 2023) which is a time-consuming process.PGR-TK (Chin et al., 2023) reconstructs local haplotype structures from genomic sequences but it is does not directly model genes and is not intended for whole-genome data.The current human pangenome tooling is not designed for studying gene variations.
In contrast, research on bacterial pangenome focuses on proteincoding genes instead of genome sequences, to the point that in the literature, a bacterial "pangenome" often refers to the collection of protein-coding genes.Several high-quality tools have been developed for constructing the gene content of bacterial genomes (Page et al., 2015;Ding et al., 2018;Tonkin-Hill et al., 2020;Gautreau et al., 2020;Zhou et al., 2020).In a nutshell, they start with ab initio gene annotation in each genome, cluster the resulted protein sequences, and then post-process clusters to identify orthologous genes and to fix issues caused by imperfect assembly, annotation or clustering (Tonkin-Hill et al., 2023).These bacterial pangenome tools however have not considered splicing, multiple isoforms, frequent segmental duplications and the much larger size of the human genome.They have not been shown to work with human pangenome data.
Here we developed pangene, a new computational tool to explore the gene content of a pangenome.Unlike bacterial pangenome pipelines, pangene effectively annotates protein-coding genes by aligning protein sequences to each genome with miniprot (Li, 2023).As miniprot can align through in-frame stop codons and frameshifts, this procedure simplifies gene annotation and is robust to insertion/deletion errors in the input genomes.Furthermore, pangene constructs a bidirected gene graph and can capture inversions missed by bacterial pangenome tools.It also provides an algorithm to identify gene copy-number or gene order variations.Pangene is optimized for human genomes and also works for bacterial genomes.The gene behind x; ν(>v) = ν(<v) = v

Methods
Pangene takes a set of protein sequences and multiple genome assemblies as input, and outputs a graph in the GFA format (Li et al., 2020).It involves two steps: aligning the set of protein sequences to each input assembly with miniprot (Li, 2023), and deriving a graph from the alignment with each contig encoded as a walk of genes.Pangene provides utilities to classify genes into core genes that are present in most of the input genomes, or accessory genes otherwise.Pangene can also identify generalized "bubbles" in the graph, which represent local gene order, gene copy-number or gene orientation variations among the input genomes.
Given perfect gene annotation and orthology assignment between the genes, the pangene graph construction algorithm is conceptually simple: it takes an orthologous group as a vertex and adds an edge between two groups if on an input genome, a gene in one group is adjacent to a gene in the other group (Fig. 1).The practical difficulty is to obtain accurate annotation in the presence of redundant sequences, paralogous genes and errors in assembly or alignment.
An important use of pangenome graphs is to identify genomic variations represented as "bubbles".While bubbles have been extensively studied in directed graphs (Onodera et al., 2013), they have not been rigorously defined on bidirected graphs graphs that are the necessary outputs of assemblers and pangenome constructors.If the field is to move towards the more generalized bidirected graph as a representation of genomic variation developing the concept of bubbles in this context will be necessary.Hence, how to define and to find "bubbles" in bidirected graphs will be a major topic of this article.

Defining pangene graphs
Let V be the set of genes and X = V × {>, <} be the set of oriented genes  x = <v, and vice versa.Throughout this article, we will use symbol u, v or w to denote a gene and use x, y or z to denote an oriented gene (Table 1).
Let T be the set of input contigs.They harbor oriented genes in X. 2-tuple (x, y) ∈ X × X is said to be supported by contig t ∈ T if y immediately follows x on t or, due to the strand symmetry of DNA, x immediately follows y.
A pangene graph is usually visualized as Fig. 2a (Wick et al., 2015) and can be mathematically defined in a few equivalent ways.It can be formulated as a directed graph G D = (X, E) where E ⊂ X × X is the set of edges.In the context of pangene graphs, (x, y) is an edge, also written as x → y, if (x, y) is supported by a contig.Notably, if x → y is an edge, y → x must also be an edge.In graph theory, this property is called the skew symmetry.Fig. 2e shows an example of a directed pangene graph which can be described in the GFA format (Fig. 2b).In G D , it is possible that y can reach both x and x.When this happens, x or ν(x) is said to be an inversion.
A pangene graph can also be thought as a bidirected graph G B = (V, E) where the definition of E is the same as above.An edge in E is effectively associated with two directions.For example, (>v, <w) ∈ E, also written as v><w, suggests the edge between v and w has a first direction ">" towards vertex v and a second direction "<" towards vertex w.Fig. 2c shows the bidirected equivalence of Fig. 2a.
In a directed graph, x → y ← z is not a path because x cannot reach z.The similar constraint is applied to the bidirected graph G B .For example, v>>u<<w is not a path because both v and w go into u.v>>u><w is a path, which can also be expressed as a path in G D : (>v, >u, <w).
As pangene constructs the graph from protein-to-genome alignment (Li, 2023), it also attaches alignment scores to edges.Let Tx→y ⊂ T be the set of contigs that support x → y and S(x|y, t), t ∈ Tx→y, be the alignment score of x when x → y is supported by contig t.S(x|y) is calculated as

S(x|y, t)
which is the average score of x among contigs supporting x → y.

Obtaining the protein set
Pangene requires a set of protein sequences as input which may have redundancies.In principle, if every input genome is annotated, we may merge all annotated protein sequences and align to all genomes with miniprot.This strategy works when there are a small number of diverged genomes.If there are many similar genomes, all with annotations, a faster way is to cluster protein sequences with CD-HIT (Li and Godzik, 2006) or MMseqs2 (Schneider et al., 2017), and take a representative protein sequence from each cluster to generate the protein set.
Most human genomes do not have high-quality gene annotation.We may generate the protein set from the reference gene annotation supplemented with protein-coding genes or diverged alleles not present in the reference genome, such as HLA-DRB3.Pangene allows one gene to be associated with multiple protein sequences.If protein names follow format <GeneID>:<ProteinID>, pangene will select one protein for each <GeneID> and use <GeneID> as the segment name in the output GFA file.

Selecting non-orthologous genes
The central problem in graph construction is how to consistently annotate multiple genomes with a subset of input proteins.A naive way to annotate a genome is to align all protein sequences to the genome and if there are multiple proteins mapped to the same locus, select the best scoring protein as the annotation.This simple strategy does not work if there are orthologous proteins in the input protein set.For example, suppose gene v and w are orthologous to each other and are thus aligned to the same locus in each genome.When some genomes match v better and the rest match w better, both v and w will be considered as accessory genes that are not present in all genomes.However, the preferred choice is to elect one of them as a core gene.
Pangene selects non-orthologous genes as follows.For gene v, pangene inspects its best hit in each genome (hashes of gene names are used to break ties) and counts b(v) the number of genomes where v is aligned better than all other genes overlapping with v. Pangene then traverses each gene v in the descending order of current b(v).If b(v) > 0, select v as a non-orthologous gene and for each genome where the best mapping of w overlaps with v but is aligned better than v, reduce b(w) by one.This way, if v and w are orthologs, their best mapping positions likely overlap and thus only one of them will be selected; if v and w are paralogs, their best mapping positions usually differ and thus both will be selected.

Adjusting pangene graphs
Pangene constructs an initial graph using genes selected from the last section.It applies additional heuristics to improve gene annotations based on the graph topology and alignment scoring in two special cases.
First, a gene may be wrongly mapped to its paralogs if its best position is missing.Suppose a gene on chromosome X has a second best hit on an autosome.If the alignment score of the autosomal hit is lower than the hit to chromosome X by 3% miniprot will only report the hit on chromosome X.However, because a phased male human assembly does not have chromosome X, the gene will be mapped to the autosome (Fig. 3a).This false hit will lead to spurious edges in the initial graph (Fig. 3b).To solve this problem, pangene marks x → y as a false edge if there exists x → z such that y and z are on different contigs in every input genome and S(x|y) < S(x|z) • r1 (r1 defaults to 95%).Pangene then filters out all alignment of x that are incident to a false edge x → y (Fig. 3c).
Second, similar to the first case, if the best position of a gene is missing due to a real deletion in evolution, this gene may interfere with the alignment of paralogous genes (Fig. 3d) and leads to a graph (Fig. 3e) inconsistent with manuscript annotation (Liao et al., 2023).To improve this case, pangene marks x → y to be low priority if there exists x → z such that y and z are on the same contig in an input genome and S(x|y) < S(x|z)•r2 (r2 defaults to 98%).Pangene then marks all x alignment that are incident to a low-priority edge x → y to have low priority as well and prefers a gene without the low-priority mark.In Fig. 3e, x = <CYP2D6, y = <TCF20 and z = <CYP2D7.<CYP2D6 is marked to have low priority on the second contig.This results in a graph in Fig. 3f that is more consistent with the known evolution at this locus (Liao et al., 2023).
Both cases may also be caused by processed pseudogenes.When a protein-coding gene has a recent pseudogene that has a near identical sequence, the two heuristics may not work due to the score ratio threshold.By default, pangene filters out an alignment without splicing as a potential pseudogene alignment if the protein has a spliced alignment in other samples.Pangene also has an option to drop a protein if it does not have spliced alignment in any sample.For human samples, these strategies can reduce false connections between chromosomes.

Past work on bubble finding
The concept of bubble is perhaps first coined by Zerbino and Birney (2008) in the context of assembly graphs.A bubble was initially meant to be simple in that paths through a bubble do not share vertices except the start and the end vertices.Onodera et al. (2013) introduced superbubble to allow more complex acyclic topology and found a quadratic algorithm to identify all superbubbles.The time complexity was improved to O(|E| log |E|) by Sung et al. (2015) and then to linear by Brankovic et al. (2016).
A subgraph induced by a superbubble is acyclic, but in a pangene graph or a variation graph, we also care about cyclic subgraph.Gärtner et al. (2018) defined weak superbubble without the acyclic condition, and presented a linear algorithm to identify all weak superbubbles (Gärtner and Stadler, 2019).Interestingly, more than 20 years before this work, Johnson et al. (1994) had already come up with a linear algorithm to effectively find weak superbubbles in the context of compiler design, which will be the basis of our algorithm.Weak superbubble is defined on directed graphs, not on bidirected graphs.If we apply the previous algorithms to directed graph G D , we will find most superbubbles twice on opposite strands.More importantly, superbubble cannot capture the visual intuition in the presence of inversions.For example, in Fig. 4a, >A and <C enclose a subgraph that looks like a bubble, but the equivalent directed graph (Fig. 4f) does not have a superbubble when >A and <A are taken as different vertices.Dabbaghie et al. (2022) implemented the algorithm by Onodera et al. (2013) for sequence graphs, but how the issues above are handled is not apparent.The closest concept to weak superbubble in bidirected graphs is snarl (Paten et al., 2018), which leads to a bidirected subgraph that can be separated from the rest of the graph.Snarl does not require reachability which is necessary in defining superbubbles.To this end, there are no equivalent definition of weak superbubble in directed graphs.

Defining generalized bibubbles
The definition of "bubble" here tries to capture the visual intuition.It is inspired by Onodera et al. (2013).Let U (x, y) ⊂ V be the set of vertices that are reachable from x without passing through x, x or y.A generalized bidirected bubble, or a generalized bibubble in short, is a pair (x, y) ∈ X × X such that i) U (x, y) = U (y, x) ̸ = ∅; ii) ∀v ∈ U (x, y), there is a walk from x to y that passes v; iii) there does not exist z ∈ U (x, y) × {>, <} that satisfies U (x, z) = U (z, x) or U (z, y) = U (y, z).Due to the skew symmetry, if (x, y) is a generalized bibubble, (y, x) is also a generalized bibubble.
In the definition of U (x, y), we intentionally allow walks passing through y; otherwise U (>A,<C) in Fig. 4g would become {B}, making (>A,<C) a generalized bibubble.This example also suggests (x, y) would not be a generalized bibubble if x ∈ U (x, y) or y ∈ U (x, y).Condition ii) above aims to exclude examples such as Fig. 4h.Condition iii) requires a bibubble to be minimal.As a result, it is not possible for (x, y) and (x, z) both to be generalized bibubbles unless y = z.Generalized bibubbles are also nested in that if (x, x ′ ) and (y, y ′ ) are generalized bibubbles and U For a generalized bibubble (x, y), let U (x, y) = V \ U (x, y) \ {ν(x), ν(y)}.No vertex in U (x, y) can reach vertices U (x, y) because otherwise w ∈ U (x, y) reachable from x (or from y) would not pass through y (or x) and would be added to U (x, y) (or U (y, x)) -this would violate the definition of U (x, y).This observation indicates that removing ν(x) and ν(y) would separate U (x, y) from the rest of the graph.
A naive algorithm to identify all generalized bibubbles is to enumerate each pair of oriented genes and then test whether the pair forms a generalized bibubble based on the definition.This is an O(|V | 2 • (|V | + |E|)) algorithm, impractical for large graphs.To describe a new algorithm for bubble finding, we will first introduce net graphs and cycle equivalence.

Biedged graphs and net graphs
Let Z = V × {s, e}, where symbol "s" represents the start of a gene and symbol "e" represents the end.Given a bidirected graph (v, e)}|v ∈ V and E l being the set of undirected connections between starts or ends of genes in Z (Paten et al., 2018).Fig. 2d shows an example.Note that a valid walk in a biedged graph cannot pass two consecutive edges in E l .Due to this restriction, many classical algorithms in graph theory are not applicable to a biedged graph.
A net graph G N = (P, En) is constructed by contracting all edges in E l .As a result, a vertex in P is a component connected by edges in E l and an edge represents a gene in V (Fig. 2e and Fig. 4e).Because En and V has one-to-one relationship, we will also use v to denote an edge in G N .A net graph is an undirected graph and is not equivalent to the bidirected graph G B .For example in Fig. 2a, there is not a walk that goes through both gene C and D, but in Fig. 2f, there is a walk involving both genes.
In graph theory, a set of edges in

Cycle equivalence
Two edges e1 and e2 in G N are said to be cycle equivalent if every cycle containing e1 contains e2 and vice versa.Cycle equivalent edges form equivalent classes, which can be determined in linear time (Johnson et al., 1994).
For any pair of e1 and e2 that are cycle equivalent, {e1, e2} is a cut-set.To see this, let p11 and p12 are the two vertices incident to e1.Without losing generality, suppose Gn is a connected graph.If Gn remained connected after removing e1 and e2, there would exist a vertex q that can reach both p11 and p12 without passing through e2 (because e2 has been removed).Then e1 would be in a cycle with q in the original G N but without passing through e2, violating the cycle equivalence of e1 and e2.
Conversely, if {e1, e2} is a cut-set, e1 and e2 are cycle equivalent.To prove this, let {e1, e2} partition the vertex set P into P1 and P2.If e1 and e2 were not cycle equivalent, there must exist a cycle that only contains e1 but not e2.Then cutting both e1 and e2 would still keep P1 and P2 connected due to this cycle, violating that {e1, e2} is a cut-set.
To this end, {e1, e2} is a cut-set if and only if e1 and e2 are cycle equivalent.Then if (x, y) is a generalized bibubble in G B , ν(x) and ν(y) are cycle equivalent in G N .The reverse is not always true.For example, in Fig. 2f, "C" and "D" are cycle equivalent, but (>C,>D) or in any orientation is not a generalized bibubble.Similarly, "A" and "B" are cycle equivalent in Fig. 5c but they do not enclose generalized bibubbles.

Identifying generalized bibubbles
Given a bidirected graph G B , pangene transforms it to a net graph G N and assigns a class number to each gene v such that v and w are in the same class if they are cycle equivalent in G N .For each oriented gene x that has multiple edges, pangene applies a breadth-first

Results
We applied pangene to HPRC samples and manually inspected subgraphs around CYP2D6, RHD, C4, HLA-DRB1 and KIR genes which have been curated for these samples (Liao et al., 2023;Zhou et al., 2024).We also looked at many other genes that are known to have gene-level variations (Sudmant et al., 2010;Boettger et al., 2012;Handsaker et al., 2015).While these genes have not been manually annotated for HPRC samples, we could check whether pangene captured known events such as the MAPT inversion and the ORM1 copy-number changes.
Due to the lack of genome-wide ground truth in human data, we could only manually evaluate pangene in individual subgraphs.To get a better sense of overall statistics, we additionally applied pangene to two bacterial datasets and compared the results to existing bacterial pangenome tools.

Structural variants between two human genomes
We downloaded GenCode comprehensive human annotation v45, retained the protein-coding transcripts tagged with "Ensembl canonical" and filtered out readthrough transcripts (a transcript that joins two adjacent genes) and mitochondrial genes.Only one transcript was selected per gene.Because GenCode does not include HLA-DRB3, HLA-DRB4 and several KIR genes, we manually added 20 HLA-DRB genes and alleles from the IPD-IMGT/HLA database and 15 KIR genes from the IPD-KIR database.In the end, we constructed a protein set with 19,421 sequences.
We aligned the proteins to the human reference genome GRCh38 (Schneider et al., 2017) and T2T-CHM13 (Nurk et al., 2022) with miniprot (Li, 2023) v0.12 under option "--outs=0.97-Iu" and constructed a pangene graph.The resulting graph contains 19,088 genes and 19,390 edges.We identified 91 generalized bubbles in the graph that contain 100 genes or fewer.The bibubbles involved 691 genes affected by gene copy, gene order or gene orientation changes between the two genome.These genes included many known events such as PDPR, SMN2, CTAGE9, HPR, ORM1, CCL4, NCF1 and Amylase (Handsaker et al., 2015;Sudmant et al., 2010).Some of the large gene clusters affected by structural changes, such as SMN2 and Amylase, would not be easily captured by wholegenome alignment as they are not represented by colinear genome alignments.

Analyzing 100 human haplotypes
We obtained the assembly of CN1 (Yang et al., 2023), YAO (He et al., 2023) and 47 samples released by the Human Reference Pangenome Consortium (Liao et al., 2023).Together with GRCh38 and T2T-CHM13, this gave us 100 human haplotypes.We aligned the same set of proteins to the 100 haplotypes.Pangene took less than one minute to construct a graph and less than one second to identify 266 generalized bibubbles in the graph.Many of the bibubbles were supported by one contig only.Pruning edges supported by one contig resulted in a graph containing 209 generalized bibubbles.Further dropping about 2,000 single-codingexon genes led to a graph containing 163 bibubbles.These included manually confirmed gene-level variations around CYP2D6, C4 and RHD (Liao et al., 2023) and around HLA-DBR1 (Zhou et al., 2024).
Pangene effectively annotates each input genome.For a sanity check, we compared the pangene gene annotation to the GENCODE annotation.Out of 19,043 protein coding genes in both annotations, only four gene locations are different.In one example, pangene assigned NUDT4B to NUDT4.Pangene misses NUDT4B because it is a single-exon gene, and pangene puts NUDT4B at the NUDT4 locus because the protein sequence of NUDT4 is a strict substring of NUDT4B.
We also compared the pangene annotation to the T2T-CHM13 annotation generated by Ensembl.There are 41 differences out 18,676 genes present in both annotations.These include four genes CSAG2, CSAG3, MAGEA6 and MAGEA3 that are close to each other.Pangene gives better protein-to-genome alignment for the four genes.We speculate that Ensembl may be annotating these genes based on synteny with GRCh38, but synteny here might be broken due to an apparent inversion around the genes.For another example, in the RCCX gene cluster (Carrozza et al., 2021), pangene annotates CYP21A2 ahead of TNXB but Ensembl puts CYP21A2 inside an intron of TNXB, which seems to be an error.Ensembl also misses C4A, another gene in RCCX.We believe pangene is doing better in these cases.Most other genes among the 41 differences come from huge complex gene clusters.We cannot tell what the correct annotation is.
We next investigated genes with presence/absence variants (PAVs).Out of 16,996 multi-exon genes in the graph, 16,315 were on the autosomes of GRCh38.3.4% of them were absent from one or more haplotypes and 0.4% (n=58) were only present in ≤50% of haplotypes.Among the 58 genes, TRIM64, for example, is close paralogous to TRIM64B.On most haplotypes, TRIM64B was aligned better than TRIM64 and pangene annotated two TRIM64B genes but no TRIM64.Therefore TRIM64 became a PAV.When we excluded genes having paralogs of ≥95% identity from the list, only 14 were left.This list included three KIR genes and HLA-DRB4 that are known to be PAVs (Zhou et al., 2024).We checked a few remaining genes in the list.PRH1 and GSTM1 were in relatively simple regions and look like real PAVs.Both of them have paralogs of identity below 95%.GSTT2 has a close paralog GSTT2B that CD-HIT failed to identify -it seemed an algorithm glitch.The flanking region around NOTCH2NLR looked complex and was poorly assembled in most haplotypes.This may explain the low occurrence rate of this gene.Overall, almost all human genes or their close paralogs are present in the majority of human haplotypes.It is rare to see a protein-coding gene with sequences completely missing without paralogs as backups.

Incorporating 10 great ape haplotypes
We also constructed a graph including ten haplotype assemblies of great apes, including chimpanzee, bonobo, gorilla and two orangutan subspecies (Makova et al., 2023).We filtered out singleexon genes and pruned edges supported by one contig.This graph contained 239 generalized bibubbles, 131 of which were polymorphic among the 100 human samples.The fewer polymorphic bibubbles in human (in comparison to 163 identified without great ape samples) were caused by the merging of small bibubbles, by the increased difficulty in resolving paralogs given more remote outgroups, or by chromosome-scale inversions or translocations that disrupted the local bubble topology.The last cause seemed to be the major Panaroo, pangene and PPanGGOLiN were applied to two sets of gapless bacterial assemblies: 152 M. tuberculosis (Mtb) strains and 50 E. coli strains.Mtb genes were annotated by Prokka and Ecoli genes were annotated by NCBI and downloaded from GenBank.A "core" gene is a gene that is inferred to be present in ≥99% of assemblies in each dataset.Panaroo was invoked in the strict mode with paralogs merged (option "--clean-mode strict --merge paralogs").
contributor.We will use the 17q21.31inversion around MAPT as an example.The 17q21.31 inversion is polymorphic in the human population (Boettger et al., 2012;Steinberg et al., 2012).In the graph derived from the 100 human haplotypes, pangene identifid a generalized bibubble corresponding to this inversion.The bibubble contained LRRC37A and LRRC37A2 which have similar sequences.The two genes have another paralog LRRC37A3 that is ∼18 Mb away.All great apes only have one copy of the gene.Intriguingly, gorilla haplotypes are similar to human haplotypes at 17q24.1 but chimpanzee and orangutan haplotypes involve genes at 17q21.31 (Fig. 6).As a result, the pangene graph including great apes connected genes in 17q24.1 and 17q21.31and broke the 17q21.31bibubble in the human-only graph.Involving outgroup species may hurt the finding of localized bibubbles.

Analyzing 152 M. tuberculosis strains
We downloaded the M. tuberculosis reference strain H37Rv and its gene annotation from RefSeq (AC:GCF 000195955.2) and obtained the complete long-read assemblies of 151 other strains (Peker et al., 2021;Marin et al., 2022;Hall et al., 2023).Following the instruction of Panaroo (Tonkin-Hill et al., 2020), we ran Prodigal (Hyatt et al., 2010) v2.6.3 on the reference strain to train the Prodigal model and ran Prokka (Seemann, 2014) v1.14.6 with the pretrained model to predict protein coding genes in all strains.We used CD-HIT (Li and Godzik, 2006;Fu et al., 2012) v4.8.1 with option "-c 0.98" to cluster non-reference protein sequences, which resulted in 6,744 clusters.We mapped these proteins to each M. tuberculosis genome using miniprot with option "-S" to disable splicing.We finally ran pangene with "-p.001" to keep all genes regardless of their frequency in the pangenome.
Pangene constructed a graph consisting of 4,216 genes, 3,652 of which were present in all 152 genomes (Table 2; Fig. 7).To check if pangene captured the gene content in these strains, we compared the pangene result to Panaroo v1.3.4.We aligned the Panaroo proteins to the pangene proteins with MMseqs2 (Steinegger and Söding, 2017) v13.45111 and identified 46 Panaroo proteins do not hit to pangene proteins at the default E-value threshold of 10 −3 .We mapped the 46 proteins to H37Rv with miniprot and found 43 of them can be aligned and 76% of the aligned regions overlap with annotated CDS in RefSeq.Manually investigating the overlaps revealed that most of the 43 proteins were aligned to the opposite strand of some RefSeq genes or in different reading frames.Identifying homology based on the genomic locations of input proteins, pangene did not include them into the final graph.Conversely, 40 pangene proteins do not hit to Panaroo proteins.The diagram shows the number of clusters in each intersection.MMseqs2 merges paralogous genes, so the number of clusters for each tool is smaller than the number of genes in Table 2. Venn diagrams were generated with BioVenn (Hulsen et al., 2008).
We additionally ran PPanGGOLiN (Gautreau et al., 2020) v1.2.105 with the Prokka annotation as the input.PPanGGOLiN collected 4,744 genes in the pangenome with 3,463 present in all.277 genes did not hit to genes selected by pangene.274 of these genes could be aligned to H37Rv by miniprot and 90% of the aligned regions overlap with annotated CDS in RefSeq.We note that although overlapping genes in different open reading frames are rare in H37Rv (0.2% of coding regions), these genes may occur in other strains and may have functional implications (Vargas et al., 2023;Snobre et al., 2024).PPanGGOLiN's preference to count such genes in the pangenome is not necessarily a weakness.

Analyzing 50 E. coli strains
M. tuberculosis has low diversity with each strain similar to each other (Marin et al., 2022).To understand how pangene performs given more diverse strains, we collected 50 E. coli genomes with complete assemblies (Shaw et al., 2021).
We followed the same procedure to run pangenome tools.To get clean graph, pangene by default filters out genes that had >10 edges or connected >3 distant loci in the graph.This filter only removed six genes in M. tuberculosis dataset but it filtered out 848 genes in E. coli.We added option "-p.001 -g50 -r10" to retain more high-copy genes and genes connecting multiple loci.Although pangene collected fewer genes (Table 2), only 72 Panaroo genes do not hit to genes collected by pangene and the pangene pangenome size after clustering is not noticeably smaller than the Panaroo and the PPanGGOLiN pangenomes (Fig. 7).The differences between the tools might be determined by subtle thresholds on how to resolve homology and may not reflect the capability of each algorithm.

Discussions
Pangene constructs a graph to jointly represent the gene content of multiple high-quality genomes.It provides approximate gene annotation and orthology assignment as a byproduct.Given bacterial genomes, it plays a similar role to other bacterial pangenome tools for identifying core and accessory genes (Tonkin-Hill et al., 2023).Pangene is unique in that it also works for eukaryotic genomes and captures localized gene-level variants.Pangene complements whole-genome alignment based pangenome tools and gives a more concise high-level view of gene content changes.
Constructing a pangene graph is challenging because the biologically correct graph is unknown.We tend to think a "correct" graph should model the evolution history in principle.However, evolution is complex in the presence of convergent/divergent evolution in the long time scale and non-homologous recombinations and non-allelic gene conversions (Vollger et al., 2023) in the short time scale.We do not know the evolution history of most complex loci except several well-studied cases.Furthermore, we take each gene as an atomic unit but biologically, a gene may be subjected to structural changes such as losing/gaining exons or fusing with another gene.These events are difficult to model even if we had the exact annotation in all input genomes.
The pangene graph construction algorithm is ad hoc.We have manually tuned various heuristics to correctly encode known genelevel variants in human, but may have overlooked other cases that pangene may fail to represent.Ideally, we would prefer to model graph construction to a global optimization problem.We have not been able to find such a formulation that can reliably encode known variations.This limitation is not specific to pangene; it is also applied to other pangenome construction tools.
As protein-to-genome alignment can identify orthology within mammals or even vertebrates (Li, 2023), we could in principle apply pangene to more distantly related species.We did construct a gene graph from the human and mouse reference genomes.The graph revealed synteny blocks visible from the human-mouse whole-genome alignment, but the topology of the whole graph looked complex due to frequent large-scale rearrangements over long evolutionary distance.Many generalized bibubbles identified from this graph were due to real orthologous alignments missed by miniprot.We will need another way at higher level to investigate such a graph.Another practical challenge in building a cross-species graph is to select the set of input proteins.When constructing the great ape graph, we tried to merge human and great ape protein sequences.In the resultant graph, we observed thousands of alignments in the intergenic or intronic regions of GRCh38 which degraded the overall quality of the graph.The issue will become more prominent when we take diverse genomes without high-quality gene annotation as input.Nevertheless, we note that cross-species whole-genome alignment is also challenging and is perhaps no better than pangene.We still believe with improvements, pangene can construct a cross-species graph that helps to reveal evolution in longer term.
On the theoretical side, this article presented a rigorous definition of "bubble" in a bidirected graph but it did not find an efficient algorithm to identify such generalized bibubbles.While the current implementation in pangene works for gene graphs containing ∼20,000 genes, it will be slow for a minigraph-cactus or PGGB graph that contains tens of millions of nodes.How to efficiently identify generalized bibubbles remains an open and critical problem.Furthermore, "bubbles" cannot represent rearrangements at the chromosome scale.How to explore a multi-species gene graph or an alignment graph in general may be another interesting research topic.

Fig. 2 .
Fig. 2.An example of a gene graph.(a) A gene graph for visualization.(b) The corresponding GFA format.(c) Bidirected graph.(d) Biedged graph.(e) Directed graph.(f) Net graph, which loses information.(g) A depth-first traversal of the net graph with "0" representing a super node not in the original net graph.(a) through (e) are equivalent representations.

Fig. 3 .
Fig. 3. Graph-based annotation adjustment.(a) MSL3 is located on chromosome X but has a weaker hit on chromosome 2 when the gene is aligned to a male haplotype without chromosome X.(b) The initial pangene graph.Percentages indicate alignment identity.(c) The adjusted graph after filtering out the MSL3 hit on chr2.(d) Two haplotypes involving CYP2D6 and CYP2D7 which are paralogous genes.The second haplotype should have CYP2D7 only but because CYP2D6 is longer, the CYP2D6 alignment score is higher than the CYP2D7 score.(e) The initial pangene graph assuming the second haplotype has CYP2D6.(f) The adjusted graph.

Fig. 5 .
Fig. 5. Example of gene graphs that complicate net graph based bubble finding.(a) Gene annotation on a contig.(b) The corresponding pangene graph.(c) Net graph with the dummy vertex "0" connected to "1" only.(d)Net graph with the dummy vertex connected to genes at the ends of the contig.
search (BFS) to traverse up to m genes (100 by default) reachable from x.During BFS, pangene collects y in the same equivalent class and does not traverse beyond y.It then checks each (x, y) using the definition of generalized bibubble.Because x can only appear in one generalized bibubble, pangene stops testing other (x, •) pairs if it has found a generalized bibubble.The time complexity in the worst case is O(|V | + |E| + b • m 2 ), where b is the number of genes with multiple incoming or outgoing edges and m is effectively the maximum bibubble size.On real pangene graphs, the time complexity is close to O(|V | + |E| + b • m).It takes pangene less than one second to find ∼200 generalized bibubbles in a human graph including 20 thousand genes.In addition to cycle equivalence, Johnson et al. (1994) described another linear algorithm to find single-entry-single-exist (SESE) regions in G N provided that G N contains a vertex p0 that puts every other vertex in G N in a cycle with p0.A SESE region is intuitively a bubble-like subgraph in G N .Pangene also implemented this algorithm.It introduced p0 that connects all vertices with degree one in G N and connects the first and the last gene on each reference chromosome (Fig. 2f and Fig. 5d).Because a SESE region does not always correspond to a generalized bibubble in G B , pangene still has to test each SESE region using the definition of generalized bibubble.The time complexity of this second algorithm is O(|V | + |E| + b ′ m 2 ) where b ′ is the number of SESE regions in G N .It is faster in theory, but pangene still uses the first algorithm by default as the placement of p0 is heuristic which caused one false negative in a human pangene graph.

Fig. 6 .
Fig. 6.Human and great ape haplotypes around LRRC37A* genes.A red vertical bar indicates the two genes on each side are not adjacent on the genome.In the plot, each genome contains two blocks of genes.The genomic distance between the two blocks is 17.3 Mb in human, 61.3 Mb in bonobo, 17.6 Mb in gorilla or 2.8 Mb in orangutan.GRCh38 has LRRC37A and LRRC37A2 in 17q21.31and LRRC37A3 in 17q24.1.HG00741 has a known inversion around MAPT.All great apes have only one LRRC37A* gene.Chimpanzee and bonobo have the same gene haplotype.The two orangutan subspecies also have the same haplotype.The LRRC37A3 alignment in gorilla has two in-frame stop codons.It is challenging to distinguish LRRC37A* paralogs as they are similar in sequence and lack synteny with nearby genes.

Fig. 7 .
Fig. 7.Comparing three bacterial pangenome construction tools.In each venn diagram, pangene, Panaroo and PPanGGOLiN proteins are merged and clustered with MMseqs2 (easy-cluster -c 0.9 --min-seq-id 0.5).The diagram shows the number of clusters in each intersection.MMseqs2 merges paralogous genes, so the number of clusters for each tool is smaller than the number of genes in Table2.Venn diagrams were generated with BioVenn(Hulsen et al., 2008).

Table 2 .
Number of detected genes in bacterial pangenomes